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Abstract 

Directed information theory deals with communication channels with feedback. When applied 
to networks, a natural extension based on causal conditioning is needed. We show here that 
measures built from directed information theory in networks can be used to assess Granger causality 
graphs of stochastic processes. We show that directed information theory includes measures such 
as the transfer entropy, and that it is the adequate information theoretic framework needed for 
neuroscience applications, such as connectivity inference problems. 
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I. INTRODUCTION 



Modeling and estimating connectivity is a key question often raised in neuroscience. 
Understanding connectivity is fundamental in order to decipher how neural networ ks p rocess 
information. Deriving a definition for connectivity turns out to be a problem. In 46], three 
types of connectivities are described: structural or anatomical connectivity describes the 
physical links between parts of the brain; functional connectivity describes links between 
parts of the brain that jointly react in some circumstances (the joint reaction is reflected by 
measures such as correlation or mutual information); effective connectivity is an attempt to 
add to functional connectivity the notion of direction in the information flow. Once a point 
of view is adopted, the inference problem i.e. estimating the connectivity from data, gives 
rise to numerous difficulties. For instance, in measuring effective connectivity, the different 
scales of observation of the brain (associated with different means of observation) lead to 
time series that may have very different natures and properties, and thus may lead to rather 
different conclusions. When studying, for example, networks of neurons cultured in vitro 
and recorded by Micro-Electrode Arrays, the recorded signals will usually be described as a 
mixture of point processes and continuously valued processes. Depending on the nature of 
the experiment, the correlation structure of the signals may depict short or long memory, 
leading to different processing schemes. Furthermore, approaches will be in general highly 
nonlinear. Going to a much broader scale, fMRI measurements are well modeled by Gaussian 
processes but with long range memory. These facts lead to the conclusion that there is no 
universal method for inferring a graph from multiple measurements that will reflect the 
connectivity of the brain. However, general principles may be designed and adapted to each 
situation. It is the goal of this short paper to offer such a general framework — one that relies 
on information theory and causality principles. 

Dependence analysis will provide the main tools for inferring connectivity. Such tools 
range from correlation and partial correlation to mutual information and causality mea- 
sures. Many of the most popular tools are non directional, e.g. correlation or partial 



correlation, and mutua 
in neuroscience (e.g. [l 
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ormation measures. These measures have been extensively used 



291 ]. to cite but a few). 



Alternately, some authors have defined directional measures. Some of these generalize 
partial correlation to partial directed coherence in order to have efficient second-order statis- 
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ler methods and measures have been developed using information 
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431 ] . Among these measures, the most popular one, the transfer 



entropy, is often cited in neuroscience. It has been applied, for example, in 33] to measure 
information flow in sensorimotor networks. Transfer entropy relies, by construction, on bi- 
variate analysis. One attempt to generalize it to multivariate analysis has been suggested 
in {if]]. Although not designed for solving neurosciences problem, this method uses a very 
interesting and pragmatic approach. We will discuss this in the last section. 

A different class of approaches relies on work by Wiener and Granger on causality. 
Granger causality considers that a signal x t causes a signal y t if the prediction of y t is 



increased when taking into account the past of x t . This a 



Dproach is appealing but gives rise 
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4=11 ] . Several levels of 



to many questions, philosophical as well as technical 
definition for Granger causality exist. If the definition based on linear prediction is adopted, 
operational approaches exist to assess causality between signals. These approaches and 



some 



14 



44 



inear-in-the-parameters' nonlinear extensions have been applied in neuroscience (e.g. 



451]) . Interestingly, applying Granger causality definitions within a linear modeling 



framework turns out to introduce measures mostly used in correlation based approaches 
(directed partial coherence). This opens a way to unify the different point of views. 

The goal of the paper is to propose a possible unification between Granger causality and 
information theory. This is made possible by recoursing to the framework of directed infor- 
mation theory 

'Directed information theory' has its roots in Marko's work; Marko was a German etholo- 
gist who studied communication between monkeys in the 1970's [34j. Marko remarked that 
standard information theory was not adequate in the context he studied, since feedback 
was not taken into account by symmetrical quantities such as the mutual information. He 
thus introduced directed information measures elaborated from Markov modeling of com- 



munication signals. His findings were later (re) formalized by Massey in 1990, deve 



Kramers, Tatikonda and some others in the late 1990's, and more recently 28 
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oped by 
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49]. 



All these results and developments may be referred to as directed information theory, and 
culminates in the study of communication theory through channels with feedback. Here, we 
do not consider the problem of communication in its full generality, but rather we consider 
directed information theory to assess directional dependencies between multiple time series. 
The paper is organized as follows: Granger causality graphs, as defined by the work in 
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13| . are introduced in the next section. Then, we present the essentials of directed infor- 



mation theory, with emphasis on the notion of causal conditioning. Causal conditioning is 
fundamental to assess directional dependence between multiple time series. While extend- 
ing these tools for stochastic processes, we will highlight the relationships between transfer 
entropy and directed information theory 2, (J. Section |IV] is dedicated to establishing the 
link between Granger causality graphs and directed information theory. This is one of the 
main points made in this paper. Although the paper remains deliberately at the conceptual 
level, some practical aspects such as estimation issues or testing are discussed in the last 
section. 



II. GRANGER CAUSALITY GRAPHS 



Graphical modeling is a powerful statistical method to model the dependence structure 



of multivariate random variable s |30l l52| . Graphical models have been extended to random 



processes in the ninetie s [g 



been studied, e 



■9- Q, 




dedicated to neuroscience 



12| and the learning of graphical models have subsequently 
13j | . It is worth noting that one of the first applications was 
12| . In |l3|, the concept of (linear) causality graph is introduced. 



Such a graph is a mixed graph in which nodes may be connected by directed edges as well 
as undirected edges. Each connection is defined using the concept of Granger causality, 
restricted to linear models. Later, [ll| generalized the definition of connection using the 
unrestricted Granger causality definition, i.e. based on probability measures. 



A. Granger causality 

In this section we briefly review the basics concerning Granger causality between two time 
series. Granger causality is based upon prediction theory. Let x t and y t be two stochastic 
processes indexed by Z, the set of relative integers. Let x n . t be the vector composed of all 
the samples of x from time n up to time t, or x n - t = (x n , x n+ i, . . . , x t -i,x t ). n may be equal 
to 1 in which case x\-t represents the whole past and the present of process x at time t. 
We set to t — 1 the origin of time for the sake of mathematical convenience. Once all the 
measures are defined, we implicitly let the time origin going to — oo. 

Let capital letters denote multivariate processes, X t = (xi ;t , . . . , Xw )t ). As above, X n , t 
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will denote the collection of all the samples of the multivariate time series from time n up 
to time t. 

Basically, a signal x t will be said to 'Granger cause' a signal y t if the prediction of y t 
is improved when considering not only its own past but also the past of x t . Thus a first 
definition can be given using (conditional) probability measures P of the processes: x t does 
not cause y t if and only if P(yt\yi:t-i,%i*-i) — P(vt\vi:t-i)- In other words, x t does not 
cause y t if yt is, conditionally to its own past, independent from the past of x t ; the chain 
Xi-t-i — » yitt-i — > yt is a Markov chain. 

This definition may be satisfactory only if other observations are not taken into account. 
Actually, it has been quoted by Granger that adding new observations may change the 
causality relation between two processes, i.e. 

P(vt\yid-i,xi*-i) i- P{vt\y\:t-\) 

P(yt\yv.t-i,x 1:t -i,Z 1 . t ) ^ P(y t \y 1 . t _ 1 ,Zi :t ). (1) 

The dependence relationship between two times series x and y is not guaranteed to be con- 
served when extra observations are taken into account. This means that Granger causality 
can only be considered as a property relative to the available information set. 

A very simple example to illustrate this can easily be constructed. Let x t = az t -i + e t , 
yt = bxt-i + (ft and z% = cyt-i + Vt be three processes constructed from three independent 
processes e,if,T]. Then P(x t \xi :t -i,yi;t-i) ^ P(x t \xi :t -i) whereas P(x t \x lrt -i, yirt-i, Zi:t) = 
P( %t\ x i:t-u z i-.t) • From this example, we may conclude that a relationship exists between y 
and x if z is not taken into account. If the observation of the third signal z is considered as 
well, no direct link from y to x is exhibited, as all dependencies between y and x appear to 
be related to the presence of z; including z in the analysis, y is found to not Granger cause 
x. 

Granger causality is thus mainly due to the influence of the past of a process onto the 



present of another process. Geweke [17| introduced the definition of instantaneous coupling. 
If the dynamical noises Et,<pt,Vt i n the preceding example are assumed to be white but no 
longer independent processes, there is a coupling between x t , yt and z t which is instantaneous 
(Eichler uses the word contemporaneous). Thus two types of influence have to be defined. 
Let x t and y t be two stochastic processes, and Z t a third multivariate process which does 



not contain x nor y as components. 

1. x t does not cause y t relatively to Z t P(yt\yi:t-i,Xi:t-u z i:t) = P{yt\Vi:t-i,Zu)> 
Vt > 1 

2. x t does not instantaneously cause y t relatively to Z t -<=>- P(yt\yut-u x i:u z i:t) — 
P{yt\yi:t-i,Xi:t-i,Zi :t ), Vt > 1. 

The absence of a causal relation from x t to y t corresponds to the independence between the 
present of y and the past of x, conditionally to the past of y and the extra information (Zi-t). 
Further, the lack of instantaneous causality is symmetrical with respect to x and y, since it 
simply states that x and y at time t are independent conditionally on their joint past and 
on the past of Z. 

These definitions enable us to construct a graph from a multivariate time series as follows 
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13]. Each time series is associated to a node. Two types of edges may exist between 
two nodes. A directed edge from node x to node y will mean that x Granger causes y with 
respect to the remaining time series, and an undirected edge between x and y will mean 
that x instantaneously causes y with respect to the other observed time series, stacked in Z. 
The undirected nature of the latter edge is a consequence of the symmetry of instantaneous 
causality. Precisely, let X t be an M-dimensional time series, whose components are denoted 
as Xi )t , i — 1, . . . , M. Let (V, E d , E u ) be the associated mixed graph, where V is the vertex or 
node set, E d is the set of directed edges and E u is the set of undirected edges. The cardinal 
of V is M. The vertices in V are labelled by i — 1, . . . , M, and vertex i will correspond to 
process Xi unambiguously. Then, the edge sets are defined via 

1. Vz G V, j G V, $ E d <^=^ Xi tt does not cause Xj jt relatively to X\{xi,Xj} t ) 

2. Vz G V,j G V,(i,j) £ E u <^=^ Xi t t does not instantaneously cause Xj tt relatively to 
X\\Xi, Xj} t 

where X\{xi, Xj} t is the (M — 2)-dimensional process constructed from X t by deleting com- 
ponents i and j. (V, Ed, E u ) defines a Granger Causality graph. 
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III. DIRECTED INFORMATION THEORY 



This section reviews the essential tools from directed information theory, but not from 
a communication theory point of view. Our purpose is instead to recast some results and 
definitions within the framework of dependence analysis between stochastic processes. The 
link between directed information measures and Granger causality graph will be developed 
in the next paragraph. 



A. Directional dependence between two stochastic processes 

For the sake of readability, this paragraph focuses upon studying the relations that may 
occur between two processes only, namely x and y. The role played by the existence of other 
observed process, outlined previously, and the importance of accounting for such 'extra 
information' is deferred to a later discussion. 

From a probabilistic point of view, this dependence structure is encoded in the joint 
probability measures P(x ni , . . . x njv ; y ni , . . . y nN ) for all iV and all times n>i, . . . , n 2 in Z. To 
introduce the different definitions, we restrict the presentation to the dependence between 
vectors constructed from the time series, i.e. xi :t . The extension to stochastic processes 
is discussed in section IIII CI Furthermore, we assume in the sequel that the measures are 
absolutely continuous with respect to Lebesgues measure, and we will work with probability 
density functions. 

If there is no dependence structure, or if the processes are independent, it is well known 
that the joint probability density functions factorize into p(x ni , . . .x nN ) x p(y ni , . . .y nN )- 
Consider the Kullback-Leibler divergence DKiifWg) = Ef[\ogf(x)/g(x)}, where Ef[.] is the 
expectation operator (or ensemble average) with respect to the probability density function 
/. The Kullback-Leibler divergence provides a measure of information when wrongly assum- 
ing a random variable as distributed from g when it is in fact distributed from /. Choosing 
for / the joint probability density function between two processes, and for g the product of 
the marginals then leads to a measure of independence, the well-known mutual information 

p(xi-.uyi:t) 



I(xi:t,yi:t) = E 



log- 



(2) 



Mutual information is a positive quantity (which is a property inherited from the Kullback- 
Leibler divergence) and is zero if and only if the two processes are independent {9, Q]. 



However it suffers from being symmetrical with respect to x and y and consequently it is 
useless when it comes to measuring directionality in the dependence structure. 

This symmetrical behavior appears to be closely related to the symmetry of the factoriza- 
tion of the joint probability density function p(xi- t ; yx-t) = p(%i:t)p{yi-.t) under the hypothesis 
that the processes are independent. Alternately, the following factorization is introduced: 

P(x 1:t ;yi:t) = ^ {Xl:t\yi:tT^ {Vl:t\^l:t) (3) 
t 

4 p{xi:t\yi:t) = (4) 
i=l 
* 

~t{y\:t\ X l:t) = Y[p(yi\xi:i,yi:i-l)- ( 5 ) 

i=l 

If we consider the link between x and y as a channel with input x and output y, the term 
p(yi:t\ x i:t) describes the feedforward link whereas *p (xi :t \yi : t) describes the feedback term. 
In the absence of feedback in the channel the input x at time t does not depend on the past 
of the output up to time t — 1, and the feedback factor reduces to p (xi-.t\yi:t) — p{%i-.t)- 

Mutual information is a divergence measure between the actual joint probability density 
function and its factorized equivalent expression when independence holds. In order to 
assess directionality, Massey suggests to compare the joint probability to the alternative 
factorization p(xi :t \yi :t )p(yi. t ), which correspond to a situation of no influence of x onto 
y but of the existence of feedback from y to x. A very simple example is given by x t = 
axt-i + fiyt-i + Vt and y t = ■yyt-i + w t where v t and w t are white noises independent from 
each other. 

The directed information is defined as 



I{x\:t -> yv.t) = E 



j o p(Xl:t\yi:t) 

*P(Xl:t\yi:t)p(yi:t) 



(6) 



Comparing this definition with equation (JSJ) it is observed that the difference lies in the 
term p(xi- t ) which is replaced here by the term p (xi:t\yi-.t) • This shows that the directed 
information and mutual information will be equal when there is no feedback. The main 
properties of the directed information are now summarised. In the sequel, the delay operator 
D : x t — > x t -i is denoted as Dx t for a signal and Dx\-, t = (0, x±, . . . , x t -i) = (0,xi :t -i) 
for a vector. Different proofs of the results presented hereafter exist, the simplest of which 
relies on the use of Kullback-Leibler divergence properties. For detailed proofs, refer to 



,2. 



a 1281, 
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48J. The properties are as follows. 



1. The directed information is positive. 

2. The directed information is smaller than, or equal to the mutual information. 

3. Equality between the directed information and the mutual information occurs if and 
only if there is no feedback. 

4. The directed information decomposes as 

I(xi-.t ->■ yi-.t) + I(Dy 1:t ->■ xi-.t) = I(x 1:t ; yi-.t) (7) 

The first three points are fundamental from a communication point of view. Point 2 and 3 
mean that mutual information overestimates the quantity of information flowing from one 
signal to another. This has been used by information theorists to provide closer bounds for 
the capacity of a channel with feedback. The third point ensures that directed information 
theory leads to the usual theory if there is no feedback. The last point is important as 
it shows how the information shared by two stochastic processes is decomposed into the 
sum of information flowing in opposite directions. A similar decomposition will be found 
in the sequel, in the framework of causal conditioning. The purpose of the next section is 
to provide appropriate definitions for causal conditioning and to open new perspectives for 
directed information. 

B. Causal conditioning, causal conditional directed information 

An alternative formulation for directed information may be easily obtained: 

t 

I(xi-.t ->■ Vl-.t) = ^ 1 ( Xl *i 2/<|3/l:*-l) ' ( 8 ) 

1=1 

where I(x;y\z) is the conditional mutual information between x and y given z. Directed 
information may also be expressed as a function of Shannon entropies as 

t 

I{xi-. t ->■ Vv.t) = H{y 1 , t ) - (yi\x 1:i , . (9) 

i=i 

This expression should be compared to the expression of mutual information below 

t 

I(xi: t ] Vv.t) = H(y 1:t ) (yi\x 1:t , yi :i -i) . (10) 

i=i 

9 



It appears that the only difference lies in the time horizon over which the conditioning is 
performed in the conditional entropy. For the mutual information, conditioning is performed 
for each time over the whole observation of x. For the directed information, conditioning 
for the term at time i is performed from the time origin up to time i. Kramers suggested 
referring to this conditioning as 'causal conditioning'. We keep the same name but propose 
a slightly different presentation for it. Causal conditional entropy is defined as 

H{yv.t\W-.t) = -E[\ogt{yi:t\x l:t )}. (11) 

It quantifies the information that remains when observing y once x has been causally ob- 
served. The directed information is then recovered by subtracting the latter quantity from 
the entropy of y: 

I{xi-.t -> yi-.t) = H(y 1:t ) - H{y 1:t \\xi :t ). (12) 

Causal conditioning and usual conditioning can be mixed. Kramers proposes the follow- 
ing rule: when reading from left to right, the first type of conditioning is applied. Thus, 
according to this rule, we define 

#(2/l:t|^l:t||^l:t) = H (y 1:t , x 1:t \ \ z 1:t ) - H (x 1:t \ \ z 1:t ) (13) 
t 

H(yi:t\\xi:t\z 1 ; t ) = ) H(yi\yui-x, zv.t) (14) 
i=i 

These two definitions highlight a non commutative property between classical and causal 
conditioning. In eq. (1131) . the definition is similar to the definition of usual conditional 
entropy as the difference between the joint entropy of x and y and the entropy of x alone. 
In eq. (Tl4|) . the conditioning on z is global (compared to the conditioning on x which is 
causal). In that sense, in this definition, the conditioning variable z is not necessarily a 
signal synchronous to signals x and y. Instead, eq. (fT3"|) does not make sense if z t is not 
synchronous with x t and y t . 

Finally, a causal conditional directed information can be defined. Mimicking the definition 
of conditional mutual information ( I(x;y\z) = H(y\z) — H(y\x,z) ), causal conditional 
directed information is defined as 

I( x i-.t -> yi-.t\\zi:t) = H(y 1:t \ \ z 1:t ) - H(y l:t | \x 1:t , z x . t ) 

t 

= (si:<;yi|yi:<_i,2 W ) . (15) 

i=\ 
10 



This quantity will be of crucial importance when dealing with multivariate time series. 
Furthermore, it appears in the sum of two directed information quantities flowing in opposite 
directions. Actually, it can be shown that 

+ I(x 1:t ^ yi .. t \\Dx 1:t ). (16) 

In this expression, the term I(xi- t — >■ yi;t\\Dxi-t) is named instantaneous exchange informa- 
tion and can be written as 

t 

I{xi-.t ->• yiitWDxut) = ^ I (xi :i ; yi\yi-.i-i, ari^-i) (17) 

i=l 

= 7 (xj; a?i:i-i) ■ (18) 

i=l 

The last equation is obtained since = Furthermore, this equation il- 

lustrates that the instantaneous information exchange is symmetrical in the signals x and 

y- 

The importance of instantaneous information exchange appears also in the following 
decomposition of the causal conditional directed information. Recall the following chain 
rule for the conditional mutual information [9J 

I(x, y\ z\w) = I(x\ z\w) + I(y; z\w, x). (19) 

Applying it to I (x^y^yi.^i, z ld ) leads to 

t 

I{Xl:t ~> yi:t\\Zv.t) = (xi:i-i;yi\yi:i-l,Zl:i) (20) 

i=l 

+ I yi\%l:i-l,yi:i-l,Zl:i)) 
= I{Dx V .t -)■ y\:t\\Zl:t) 

+ I(x 1 . t -)> yi:t\\Dxu t ,z 1 .. t ). 

Here, the second term is the instantaneous information exchange causally conditioned by 
the third time series z. Likewise, the decomposition holds for the directed information 

I{xi:t^yi:t\\Zl:t) = I{ Dx l-t ~+ 2/l:t I \ z l:t) 

+I{xi: t -)■ yi:t\\Dx 1:t , Zx-t). (21) 
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C. Rates for stationary processes 



All definitions introduced above make sense for processes that evolve within a finite di- 
mensional phase space. Extending these definitions to the study of stochastic processes 
requires some care. Actually the information related quantities (such as entropy) are exten- 
sive. If a stochastic process visits a phase space whose dimension increases with t, informa- 
tion quantities often diverge linearly as a function of time. Thus it makes sense to introduce 
information rates, as defined below; these definition extend the classical rates found in the 
literature: 

Ioo{x;y) = lim -I(x 1:t ; y lit ) (22) 

t->+oo t 

1 

Ioo(x ->■ y) = lim -I(x 1:t -)> y 1:t ) (23) 

t^+oo t 

Iao(x->y\\z) = lim -I(x 1:t -> yut\\z 1:t ). (24) 

t— >+co t 

All limits are assumed to exist, and the previous quantities are named mutual information 
rate, directed information rate and causal conditional directed information rate, respectively. 
A fundamental result allows a simpler expression of the rates when the processes are jointly 
stationary. When dealing with discrete valued processes (and with slightly more involvement, 
continuous random processes), one can establish that, assuming stationarity, the directed 
information rates can be written as 

Ioo{x->y) = lim I(x ht ;y t \yi:t-i) (25) 

t— 5>+oo 

Ioo(x^y\\z)= lim I(xi:uy t \yi:t-i,zi:t). (26) 

t— >+oo 



A proof of the first equality may be found in 28]; a proof for the second equality can 
be derived by following the same lines. Extendin g th ese equalities to continuous random 



processes relies upon the tools developed in 2l|, |22j, |40( . These equalities extend the famous 
result for the entropy rate 

lim -H(x 1:t ) = lim H(x t \x 1 . t -i). (27) 

t— y+oo t t—t+co 
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Interestingly, applying the preceding results to the decomposition of the directed information 
in eq. fl2Tj) leads to 



Ioo{x-*y) = lim I(x 1:t -i;yt\yi:t-i] 

t—t+oo 



+ lim I(x t ;yt\x 1:t -i,yi:t-i) (28) 

t— >+oo 

= I 0O (Dx^y) + I 0O (x^y\\Dx), (29) 

where Ioo{x — > y\\Dx) is the instantaneous information exchange rate. The other term is 
imit of I(xi-t-i,yt\yi:t-i), which is a particular instance of Schreiber's transfer entropy 



the 



2a 



43j. We thus name I^iDxi-t—i — >■ y) the transfer entropy rate. This result, already 



mentioned in {2], allows to recast all results and approaches found in the literature within a 
unique and simplified framework. Further, it highlights the fact that stationarity is implic- 
itly present in Schreiber's intuition, and that instantaneous information exchange between 
processes is lacking in his work. The decomposition can be easily done for the conditional 
rates, and leads to 

Ioo{x -> y\\z) = Ioo{Dx -> y\\z) + I^x -> y\\Dx,z). (30) 

This provides an implicit definition of conditional transfer entropy rate and conditional in- 
stantaneous information exchange rate. Furthermore, let us mention that in all the preceding 
discussion, the conditioning process z can be a multivariate process. We are now ready to 
link directed information theory and Granger causality graphs. 



IV. CAUSAL INFORMATION MEASURES TO INFER GRANGER CAUSALITY 
GRAPHS 

When confronted with a multidimensional time series, a fundamental question is to study 
its dependence structure. The approach investigated here consists of inferring a graphical 
model underlying the process that is able to account for causal relationships. A good candi- 
date for such a model is a Granger causality graph ll| . Let X t be the random multivariate 
process of interest, and x\, x<i two of its components. Recall that in a Granger causality 
graph that models a multivariate process X t , the absence of a directed edge from nodes x% 
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to node x 2 is equivalent to the conditional independence expressed by 



P(x2,t\xi,l:t-l,X2,l:t-l,X\{x 1 ,X 2 }l;t) 
P(x 2 ,t\x 2 ,l:t-l,X\{x U X 2 }l:t). 



(31) 



Similarly, the absence of an undirected edge expresses the equality 



P{x2,t\xi,l:t, X 2 ,l:t-l,X\{x 1 , X 2 } 1:t ) = 
P(x2,t\xi,l:t-lX2,l:t-l,X\{xi, X 2 } 1:t ). 



(32) 



In these expressions X\{x±,x 2 } stands for the multivariate process X without components 
Xi and x 2 . 

The problem of inferring a graph from the observed data can then be viewed as a problem 
of assessing Granger causality between ordered pair of nodes, say x and y. This is done 
relative to the remaining nodes of the graph that form the additional observed process 
X\{xi,x 2 }. 

In view of the previous definitions, we need measures to assess conditional independence 
on the past and conditional independence between present samples. Such measures were 
defined in the previous sections, within an information theoretic framework. We can now 
state the main results of the paper: 

Let (V, E d , E u ) be the Granger causality graph of a multivariate process X t . Then 

1. Vi eV,je V, <£E d ^ IooiDxi ->• x 3 \\X\{x h x 3 }) = 

2. Vi e V,j E V, £ E u I^Xi ->■ XjWDxi^XXix^xj}) = 0. 
To state it differently, we have the two following assertions: 

• Conditional transfer entropy rate is a well adapted measure in order to assess Granger 
causality between two nodes with respect to the remaining available set of observations. 

• Conditional instantaneous information exchange rate quantifies the instantaneous 
causality between two nodes relative to the other observed time series (recalling that 
each node of the graph accounts for a time series). 

As a corollary, we can state that there is no edge (directed or undirected) between two nodes 
i and j if and only if the causal conditional directed information rate I^x — > y\\X\{xi, Xj})) 
is equal to zero. 
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These assertions were proven in a previous work for the simpler case of Gaussian processes 
2, 3|. In jg] for the case of bivariate Gaussian processes, the author establishes that transfer 
entropy can be used to assess Granger causality. However, instantaneous causality is not 
mentioned by these authors. A sketch of a proof for the general case is given below. 

Firstly, let x and y be two processes such that x does not cause y relative to a third 
multivariate process X (which does not contain x nor y). Testing Granger causality relies 
upon a Markov chain dependence model X\ xt -\ — > yv.t-i — > Vt where all dependence is 
considered conditioned on X% :t . According to the assumption 'x does not cause y\ we have 
I{xi-t-x] yt\yi:t-i, Xi-.t) — 0. Therefore, the sum of such terms in equation ( 1201) equals zero 
as well. This allows us to assert that for processes that are not 'Granger causally' related, 
the conditional transfer entropy rate is zero. 

Conversely, if the rate is zero, since it is defined as the limit of a sum of positive terms, 
each individual terms is necessarily equal to zero. Then since conditional independence 
is equivalent to the nullity of the corresponding conditional mutual information, we may 
conclude that the processes are not 'Granger causally' related. 

The second assertion is shown in the same way. 



V. DISCUSSION 



In this paper, we establish that Granger causality graphs can be obtained using directed 
information measures. The emphasis was put on adapted tools for investigating Granger 
causal relationships, namely the conditional transfer entropy rate and the conditional instan- 
taneous information exchange rate. Interestingly, the sum of these two measures constitutes 
the causal conditional directed information rate. 

We illustrated that directed information theory may be thought as a fundamental ex- 
tension of information theory, especially in the case of neuroscience applications. Actually, 
feedback is a fundamental ingredient for modeling and studying of the brain structures at 
all scales. Directed information, as it is presented here, is shown to be an effective tool to 
assess connectivity in the brain. It will have fundamental applications in understanding the 
processing of information and/or coding information in the brain. 

Although these results are satisfactory from a theoretical point of view, some difficulties 
remain when it comes to develop practical estimators for the different information related 
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quantities introduced so far. The remainder is devoted to discussing some practical imple- 
mentation issues related to the inference of a Granger causality graph. 

Firstly, we have to assume ergodicity and stationarity of the signals if we want to estimate 
the information rates from a single realization of the multivariate process. The stationarity 
assumption further simplifies the analysis, since this assumption simplifies the definition 
of information rates. In the case of real neural data, the stationarity property is usually 
satisfied over certain time scales only (it is thus highly context dependent). Regarding 
ergodicity, this assumption is required, as otherwise time averaging cannot replace ensemble 
averages, which may lead to severe practical difficulties for evaluating statistical quantities. 

Secondly, rates are defined as limits and in general cannot be evaluated. It is thus usual 
to introduce a finite length observation window, over which the information measures are 
evaluated. However, this approach replaces limits by finite size samples and does not not 
warrant that the initial conditions are forgotten; it may introduce some systematic bias in 
the analysis, as illustrated for example in |2j for the case of information flows between the 
components of two dimensional AR(1) processes. Once the limitation to finite size samples 
has been accepted, the estimation of conditional mutual information quantities required has 
to be performed. Many estimators can be applied. Although we will not describe here 



the wealth o: 
in 



mutual information literature (interested readers may find interesting reviews 
, and references therein, it is worth mentioning recent promising works on 



;he use of /c-nearest neighbors to estimate entropies and (conditional) mutual information 



m, I23, 129 



32 



511 ] . One of the most attractive features of these techniques lies in the 



fact that they are almost free of parameters like bin sizes or kernel widths. This allows 



to tackle a wide variety o 
processes, as illustrated in 



situations, ranging from continuous valued processes to point 



50] . However, some drawbacks include the computational burden 



and the absence of theoretical results for the rate of convergence. Nevertheless, extensive 
Monte-Carlo simulations have proved the good behavior of these estimators in moderate 



dimensions (up to 5 or 6) |4], lla, l29[. Let us also mention an ingenious trick explained 
in 16[ which consists for the conditional mutual information I(x;y\z) in conditioning by 
the time samples of z that share as much information as possible with x. This allows to 
effectively reduce the dimension. Another rarely considered difficulty lies in the different 
natures and properties encountered in neural data. As outlined in the introduction, neural 
data may behave as point processes, exhibit some long range dependencies and are often non- 
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stationary. These properties (and lack of properties) make the estimation issue very difficult, 
and the estimation of information measures, despite a lot of beautiful works, remains a 
challenging field of research. In this respect prospective works may concern the use of 
approximate measures based on Gram-Charlier or Edgeworth expansion of the densities 



The second issue met in practice is the detection issue: assuming that some information 
rate related measure estimate is available, it must be decided whether an edge exists or 
not within the graph. This is a classical problem of statistical testing theory for which the 
empirical information rate serves as a test statistics. Theoretically, if it is zero, no edge is 
placed between the nodes of interest. As the measure will practically not be zero we have 
to choose a threshold over which the measure is decided to be significantly non zero. The 
most popular approach to solve this problem is due to Neyman and Pearson, and consists 
of optimizing the test under the constraint that false positive decision errors (making the 
wrong decision that an edge exists) remain below some constant chosen value, referred to 
as the test 'significance level'. 

Of course the level is a probability, and evaluating its value requires a knowledge of the 
probability density function of the estimated information rate (serving as the test statistics 
here) under the null hypothesis. Since the test statistics used is a very complicated nonlinear 
transform of the data, this probability measure is hardly known. But the thresholds to apply 
can be evaluated by using bootstrapping strategies, surrogate data or random permutations 



181 ] . This is of course only possible at the expense of an increase in computational load. 



Finally, the last problem at hand is that of multiple testing that must be correctly handled. 
It is known that when multiple testing is performed, as is the case when deciding the presence 



of edges between multiple pairs of nodes, controlling the level of the test is not easy 



3l|. 
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